24 research outputs found
Profiling relational data: a survey
Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases
AutoML in Heavily Constrained Applications
Optimizing a machine learning pipeline for a task at hand requires careful
configuration of various hyperparameters, typically supported by an AutoML
system that optimizes the hyperparameters for the given training dataset. Yet,
depending on the AutoML system's own second-order meta-configuration, the
performance of the AutoML process can vary significantly. Current AutoML
systems cannot automatically adapt their own configuration to a specific use
case. Further, they cannot compile user-defined application constraints on the
effectiveness and efficiency of the pipeline and its generation. In this paper,
we propose Caml, which uses meta-learning to automatically adapt its own AutoML
parameters, such as the search strategy, the validation strategy, and the
search space, for a task at hand. The dynamic AutoML strategy of Caml takes
user-defined constraints into account and obtains constraint-satisfying
pipelines with high predictive performance
The Need for Incorporation of the Principles of Fiscal Sociology in Social Policy in Ukraine
У статті запропоновано використати новий принцип фінансування соціальних видатків у країні з недо-статнім рівнем демократії в умовах економічної кризи, який пропонують назвати анти-оптимум Парето.В статье предлагается использовать новый принцип финансирования социальных расходов в стране с недостаточным уровнем демократии в условиях экономического кризиса, который предлагается назвать анти-оптимум Парето.In the article it is suggested to use new principle of financing social charges in a country with the insufficient level of democracy in the conditions of economic crisis, which it is suggested to name аnti-optimum of Pareto
Unsupervised String Transformation Learning for Entity Consolidation
Data integration has been a long-standing challenge in data management with
many applications. A key step in data integration is entity consolidation. It
takes a collection of clusters of duplicate records as input and produces a
single "golden record" for each cluster, which contains the canonical value for
each attribute. Truth discovery and data fusion methods, as well as Master Data
Management (MDM) systems, can be used for entity consolidation. However, to
achieve better results, the variant values (i.e., values that are logically the
same with different formats) in the clusters need to be consolidated before
applying these methods.
For this purpose, we propose a data-driven method to standardize the variant
values based on two observations: (1) the variant values usually can be
transformed to the same representation (e.g., "Mary Lee" and "Lee, Mary") and
(2) the same transformation often appears repeatedly across different clusters
(e.g., transpose the first and last name). Our approach first uses an
unsupervised method to generate groups of value pairs that can be transformed
in the same way (i.e., they share a transformation). Then the groups are
presented to a human for verification and the approved ones are used to
standardize the data. In a real-world dataset with 17,497 records, our method
achieved 75% recall and 99.5% precision in standardizing variant values by
asking a human 100 yes/no questions, which completely outperformed a state of
the art data wrangling tool
Duplicate Table Detection with Xash
Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates
Advancing the discovery of unique column combinations
Unique column combinations of a relational database table are sets of columns that contain only unique values. Discov-ering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are ei-ther brute force or have a high memory load and can thus be applied only to small datasets or samples. In this pa-per, the well-known Gordian algorithm [9] and “Apriori-based ” algorithms [4] are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCA-Gordian com-bines the advantages of Gordian and our new algorithm HCA, and it outperforms all previous work in many situa-tions
SPRINT: Ranking Search Results by Paths
Graph-structured data abounds and has become the subject of much attention in the past years, for instance when searching and analyzing social network structures. Measures such as the shortest path or the number of paths between two nodes are used as proxies for similarity or relevance[1]. These approaches benefit from the fact that the measures are determined from some context node, e.g., “me ” in a social network. With SPRINT, we apply these notions to a new domain, namely ranking web search results using the linkpath-structure among pages. SPRINT demonstrates the feasibility and effectiveness of Searching by Path Ranks on the INTernet with two use cases: First, we re-rank intranet search results based on the position of the user’s homepage on the graph. Second, as a live proof-of-concept we dynamically re-rank Wikipedia search results based on the currently viewed page: When viewing the Java software page, a search for “Sun ” ranks Sun Microsystems higher than the star at the center of our solar system. We evaluate the first use case with a user study. The second use case is the focus of the demonstration and allows users to actively test our system with any combination of context page and search term